EPJ manuscript No. 

(will be inserted by the editor) 



(N 

o 

O 
D 

Q 

in 



Spectral properties of Google matrix of Wikipedia and other 
networks 

L. Ermann 1 ' 2 , K.M. Frahm 2 , and D.L. Slicpelyansky 2 

1 Departamento de Ffsica Teorica, GlyA, Comision Nacional de Energi'a Atomica, Buenos Aires, Argentina 

2 Laboratoire de Physique Theorique du CNRS, IRSAMC, Universite de Toulouse, UPS, 31062 Toulouse, France 

Received: December 5, 2012 

Abstract. We study the properties of eigenvalues and eigenvectors of the Google matrix of the Wikipedia 
articles hyperlink network and other real networks. With the help of the Arnoldi method we analyze the 
distribution of eigenvalues in the complex plane and show that eigenstates with significant eigenvalue 
modulus are located on well defined network communities. We also show that the correlator between 
PageRank and CheiRank vectors distinguishes different organizations of information flow on BBC and Le 
Monde web sites. 



PACS. 89.75.Fb Structures and organization in complex systems 
trees - 89.20.Hh World Wide Web, Internet 



89. 75. He Networks and genealogical 



CZ2 



> 

00 

o 

(N 
(N 



1 Introduction 

With the appearance of the World Wide Web (WWW) [1 
the modern society created huge directed networks where 
the information retrieval and ranking of network nodes be- 
comes a formidable challenge. The mathematical grounds 
of ranking of nodes are based one the concept of Markov 
chains [5] and related class of Perron- Frobenius operators 
naturally appearing in dynamical systems (see e.g. [3]). A 
concrete implementation of these mathematical concepts 
to the ranking of WWW nodes was started by Brin and 
Page in 1998 I]. It is significantly based on the PageRank 
algorithm (PRA) which became a fundamental element of 
the Google search engine broadly used by Internet users 

Already in 1998 Brin and Page pointed out that "de- 
spite the importance of large-scale search engines on the 
web, very little academic research has been done on them" 
[4]. Since that time the academic studies have been con- 
centrated mainly on the properties of the PageRank vector 
determined by the PRA (see e.g. [5] , 015] , [5] ) . Of course, 
the PageRank vector is at the basis of ranking of network 
nodes but the whole description of a directed network is 
given by the Google matrix G. Thus it is important to 
understand the properties of the whole spectrum of eigen- 
values of Google matrix and to analyze the meaning and 
significance of its eigenstates. Certain spectral properties 
of G matrix have been analyzed in [5] , QMI] , [I1[T3],[I1 
I15j . Here, we concentrate our spectral analysis on the 
Wikipedia articles network studied in [16] . The advantage 
of this network is due to a clear meaning of nodes, de- 
termined by the titles of Wikipedia articles thus simplify- 



ing the understanding of information flow in this network. 
In addition to that we analyze the statistical properties 
of eigenvalues and eigenstates of G for WWW networks 
of Cambridge University, Python, BBC and Le Monde 
crawled in March 2011. 

The Google matrix elements of a directed network are 
defined as [1MI7] 



Gij = aS %3 + (1 - a)/N 



(1) 



where the matrix Sy is obtained from an adjacency matrix 
Aij by normalizing all nonzero columns to one $y = 
1 ) and replacing columns with only zero elements by 1 /N 
{dangling nodes) with N being the matrix size. For the 
WWW an element of the adjacency matrix is equal 
to unity if a node j points to the node i and zero oth- 
erwise. The damping parameter a in the WWW context 
describes the probability (1 — a) to jump to any node for a 
random surfer. For WWW the Google search engine uses 
a m 0.85 [5]. The matrix G belongs to the class of Perron- 
Frobenius operators [5] , its largest eigenvalue is A = 1 and 
other eigenvalues have |A| < a. The right eigenvector at 
A = 1, which is called the PageRank, has real nonneg- 
ative elements P(i) and gives a probability P(i) to find 
a random surfer at site i. Due to the gap 1 — a « 0.15 
between the largest eigenvalue and the other eigenvalues 
the PRA permits an efficient and simple determination of 
the PageRank by the power iteration method. Note that 
at a = 1 the largest eigenvalue A = 1 is typically highly 
degenerate to due to many invariant subspaces which de- 
fine many independent Perron-Frobenius operators which 
provide (at least) one eigenvalue A = 1. This point and 
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also a numerical method to determine the PageRank for 
the case 1 — a <C 1 are described in detail in [13] . 

Once the PageRank (at a = 0.85) is found, all nodes 
can be sorted by decreasing probabilities P(i). The node 
rank is then given by index K(i) which reflects the rele- 
vance of the node i. The top PageRank nodes are located 
at small values of K (i) = 1, 2, .... 

In addition to a given directed network Aij it is use- 
ful to analyze an inverse network with inverted direction 
of links with elements of adjacency matrix Ay — > Aji. 
The Google matrix G* of the inverse network is then con- 
structed via corresponding matrix S* according to the re- 
lations ([1]) using the same value of a as for the G matrix. 
The right eigenvector of G* at eigenvalue A = 1 is called 
CheiRank giving a complementary rank index K*(i) of 
network nodes [T51HB] . [151115] . . It is known that the 
PageRank probability is proportional to the number of in- 
going links characterizing how popular or known a given 
node is while the CheiRank probability is proportional 
to the number of outgoing links highlighting the node 
communicativity (see e.g. IMS], [7115] . [T6UT9] ) . The statis- 
tical properties of the node distribution on the PageRank- 
ChciRank plane arc described in [19 for various directed 
networks. 

The paper is composed as following: the spectrum of 
the Google matrix of various networks is analyzed in Sec- 
tion 2, statistical properties of eigenstates are discussed 
in Section 3, the communities related to Wikipedia eigen- 
states are examined in Section 4, the distribution of nodes 
in the PageRank-CheiRank plane is studied in Section 5, 
the link distribution over PageRank index is considered in 
Section 6, discussion of results is given in Section 7, ac- 
knowledgments are given in Section 8, Appendix Section 
9 gives all parameters of the 5 directed networks consid- 
ered here and describes in detail certain eigenvalues and 
eigenvectors. 



2 Google matrix spectrum 

We study the spectrum of eigenvalues of the Google ma- 
trix of 5 directed networks. For each network the number 
of nodes N and the number of links Ni are given in Ta- 
ble 1 (see Appendix). The spectrum is obtained numeri- 
cally using the powerful Arnoldi method described in [2U 
[22].[23j. The idea of the method is to construct a set of 
orthonormal vectors by applying the matrix (G, S, G* , 
S* or any other matrix of which we want to determine 
the largest eigenvalues) on some suitable normalized ini- 
tial vector and orthonormalizing the result to the initial 
vector. Then the matrix is applied to the second vector 
and the result is orgthonormalized to the first two vectors 
and so on. The used scalar products and normalization 
factors during the Gram-Schmidt process provide the ma- 
trix representation of the initial big matrix on the set of 
orthonormal vectors (which span a Krylov space) in a form 
of a Hessenberg matrix whose eigenvalues converge typi- 
cally quite well versus the largest eigenvalues of the initial 
matrix even if the chosen number of orthonormal vectors, 



the Arnold dimension Ua, is quite modest (3000-5000 in 
this work) as compared to the initial matrix size. 

In this work we are interested in the spectrum of the 
matrix S — G(a = 1) (or S*) since the spectrum of G(a) 
(or G*(a)) is simply obtained by rescaling the complex 
eigenvalues with the factor a (apart from "one" largest 
eigenvalue A = 1 which does not change). 

However, the highly degenerate unit eigenvalue A = 1 
of S creates convergence problems for the Arnoldi method 
and therefore as in [13,15 we first find the invariant iso- 
lated subsets. These subsets are invariant with respect to 
applications of S. We merge all subspaces with common 
members, and obtain a sequence of disjoint subspaces Vj 
of dimension dj invariant by applications of S. The re- 
maining part of nodes forms the wholly connected core 
space. Such a classification scheme can be efficiently im- 
plemented in a computer program and it provides a sub- 
division of network nodes in N c core space nodes and N s 
subspace nodes belonging to at least one of the invariant 
subspaces Vj inducing the block triangular structure of 
matrix S: 



where S ss is itself composed of many small diagonal blocks 
for each invariant subspace and whose eigenvalues can be 
efficiently obtained by direct ("exact") numerical diago- 
nalization. 

The total subspace size Ns, the number of independent 
subspaces Nd, the maximal subspace dimension d max and 
the number N± of S eigenvalues with A = 1 are given in 
Table 2. The spectrum and eigenstates of the core space 
S cc are determined by the Arnoldi method with Arnoldi 
dimension ua giving the eigenvalues A^ of S cc with largest 
modulus and the corresponding eigenvectors ipj (Gipi = 
Aj^i). The values of ua we used for the different networks 
are given in Table 1. According to Table 2 we have the 
average number of links per node Q « 21.63 (Wikipedia), 
16.91 (Cambridge 2011), 16.67 (Python), 22.77 (BBC), 
79.14 (Le Monde). 

The distributions of subspaces eigenvalues and largest 
ua eigenvalues of the core space are shown in Fig. [1] in the 
complex plane A for all 5 networks. The blue points show 
the eigenvalues of isolated subspaces. We note that their 
number is relatively small compared to those of British 
University networks [23] (up to year 2006) analyzed in [T3] ■ 
We attribute this to a larger number of Q links per node 
that reduces an effective size of isolated parts of network. 
Between 2006 and 2011, especially for Cambridge, it seems 
that the increased use of PHP and similar web software 
tends to considerably increase the value of Q. Indeed, we 
have Q ~ 10 for university networks up to 2006 [Tj5] which 
used less this kind of PHP software. In Fig.[T]the red points 
show ua eigenvalues of the core space with largest |A|. Due 
to finite ua value there is an empty white space around 
A = 0. There is no significant gap for core eigenvalues 
since Ai is rather close to 1 (see Table 3). 

In global we can say that the structure of the Wikipedia 
spectrum of S and S* is somewhat similar to those of 
Cambridge 2006 (see Fig.2 in [T3]). For Cambridge 2011 
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Fig. 1. Spectrum of eigenvalues A the Google matrices G 
(left column) and G* (right column) for Wikipedia, Cambridge 
2011, Python, BBC and Le Monde (a = 1). Red dots are core 
space eigenvalues, blue dots are subspace eigenvalues and the 
full green curve shows the unit circle. The core space eigen- 
values were calculated by the projected Arnoldi method with 
Arnoldi dimensions ua as given in Table 1. 



the spectrum of S is drastically changed compared to the 
year 2006 but for S* certain features remain common both 
for 2006 and 2011 (e.g. a circle |A| 0.5, triplet-star). For 
Python, BBC and Le Monde the imaginary parts Im(A) 
of eigenvalues of S are relatively small compared to the 
networks of Wikipedia and Cambridge. We suppose that 
there are less symmetric links in the later cases. It is in- 
teresting that for S* of Python, BBC and Le Monde the 
imaginary parts Im(A) are significantly larger than for S. 

The origin of nontrivial structures of the spectrum of G 
and G* for directed networks discussed here and in [TT1|12[ 
imilTlj] still require detailed analysis. We note that well vis- 
ible triplet and cross structures (see e.g. Wikipedia spec- 
trum in Fig. [I] and Fig. 2 of [13]) naturally appear in the 
spectra of random unistochastic matrices of size N = 3 
and 4 which have been analyzed analytically and numer- 
ically in 25 . In view of this similarity we suppose that 
networks with such structures have some triplet or quar- 
tet subgroup of nodes weakly coupled to the rest of the 
network. However, a detailed understanding of the spec- 
trum requires a deeper analysis. In the next Section we 
turn to a study of eigenstate properties. 



3 Statistical properties of eigenstates 

The dependence of PageRank P and CheiRank P* vec- 
tors on their indexes K and K* at a = 0.85; 1 — 10~ 8 are 
shown in Fig. [2J At a = 0.85 we have an approximate 
algebraic decay of probability according to the Zipf law 
P ~ \/K p ,P* ~ 1/K* P (see e.g. [H] and Refs. therein). 
We find the following values j3 for PageRank (CheiRank): 
0.96 ± 0.002(0.73 ± 0.003) Wikipedia; 0.81 ± 0.007(0.90 ± 
0.004) Cambridge 2011; 1.12 ±0.01(1.17 ±0.006) Python; 
1.20 ± 0.006(0.96 ± 0.004) BBC; 1.08 ± 0.009(0.55 ± 0.002) 
Le Monde. Formally, the statistical errors in j3 are rela- 
tively small but in some cases there are variations of slope 
in the decay of PageRank (CheiRank) probability that 
gives a dependence of (3 on a fitting range (e.g. that's why 
j3 here is a bit different from its values for Wikipedia given 
in [IS]). We note that the value j3 « 1 for the PageRank 
remains relatively stable to all networks corresponding to 
the usual exponent /i ps 2.1 of algebraic decay of the ingo- 
ing link distribution leading to = l/(fx — l) « 0.9 (see 

e.g. nm, nanus]). 

For CheiRank the variations of j3 from one network to 
another are more significant being in agreement with the 
fact that for outgoing links the exponent /i w 2.7 varies in 
a more significant manner. 

For a = 1 — 10 -8 we find that the main probability 
of PageRank and CheiRank eigenvectors is located on iso- 
lated subspaces with N s nodes; after that value there is a 
significant drop of probability for K, K* > N s . This effect 
was already found and explained in detail in [13j and our 
new data confirms that it is indeed rather generic. 

The modulus of four eigenfuctions IV^O)! from the core 
space are shown in Fig. H]by color curves as a function of 
their own index Ki which order ^(j)! in a monotonic 
decreasing order. For Python, BBC and Le Monde the de- 
cay of \ipi(j)\ with Ki is similar to the decay of PageRank 
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Fig. 2. PageRank P (left column) and CheiRank P* (right 
column) vectors are shown as a function of the corresponding 
rank indexes K or K* for the Google matrices of Wikipedia, 
Cambridge 2011, Python, BBC and Le Monde at the damping 
parameter a = 0.85 (thick black curve) and a = 1 — 10~ 8 (thick 
gray curve). The thin color curves show for each panel the 
modulus of four core space eigenvectors \%j)i\ of S (left column) 
and \tp*\ of S* (right column) versus their ranking indexes Ki 
or K* . Red and green curves are the eigenvectors corresponding 
to the two largest core space eigenvalues (in modulus) which 
are real and close to 1; blue and pink curves are the eigenvectors 
corresponding to two complex eigenvalues with large imaginary 
part. The chosen eigenvalues and other relevant quantities for 
each case are listed in Tables 1, 2, 3. 
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Fig. 3. A selection of 200 complex core space eigenvalues clos- 
est to the unit circle for the matrices S (left column) and S* 
(right column) of Wikipedia and Cambridge 2011 networks. 
The characteristics of corresponding eigenvectors are shown in 
Figs. BU 
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Fig. 4. Left column: Algebraic exponent b obtained from a 
power law fit \ipi(Ki)\ ~ K\ for Ki > 10 4 shown as a func- 
tion of the phase ip — arg(A^) of the complex eigenvalue Xi 
associated to the eigenvector tpi of S. The shown data points 
correspond to the eigenvalue selection of Fig. [3] for networks of 
Wikipedia and Cambridge 2011. Right column: The same as in 
the left column for the eigenvectors of S* . 



probability with K. For Wikipedia and Cambridge 2011 
we see that eigenvectors ^(j)! are more localized. The 
eigenstates of S* have a significantly more irregular decay 
compared to the eigenstates of S. 

To analyze the properties of core eigenstates of 
Wikipedia and Cambridge 2011 in a better way we se- 
lect 200 core space eigenvalues of S and S* being most 
close to the unitary circle |A| = 1. These eigenvalues are 
shown in Fig. [3] For these eigenvalues we compute the 
corresponding eigenvectors and by fitting a power 

law dependence \ipi{Ki)\ ~ K\ at Ki > 10 4 we deter- 
mine the dependence of the exponent b on the phase of 
the eigenvalue (p = arg(Ai). For Wikipedia we have values 
of \b\ distributed mainly in the range (1,2) for S and in 
the range (0.5, 1.5) for S*. For Cambridge 2011 we have a 
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Fig. 5. Left column: Inverse participation ratio £ipr = 
IV'iO)! 2 ) 2 / IV'iO)! 4 shown as a function of the phase 
<p = arg(Ai) of the complex eigenvalue Xi associated to the 
eigenvector ipi of S. The data points correspond to the eigen- 
value selection of Fig. [3] for networks of Wikipedia and Cam- 
bridge 2011. Right column: The same as in the left column for 
the eigenvectors of S* . 



more compact range (0.5, 1) for S while for S* there is a 
very broad variation of \b\ values in the range (1,4). 

The above approximate power law description of the 
eigenstate decay characterizes their behavior at large K 
values. The behavior at low K values can be charac- 
terized by the inverse participation ratio (IPR) 
6pr = (E,- h^(i)| 2 ) 2 /Ej Vl>i{j)\ A which gives an approx- 
imate number of nodes on which the main probability of 
an eigenstate ipi{j) is located. We note that such a charac- 
teristic is broadly used in disordered mesoscopic systems 
allowing to detect the Anderson transition from localized 
phase with finite £ to delocalized phase with £ value com- 
parable with the system size [33]. The IPR data are pre- 
sented in Fig. [5] for eigenvalues selection of Fig. [3] We find 
that £ipr values are by a factor 10 4 to 10 5 smaller then 
the network size N. This means that these eigenstates are 
well localized on a restricted number of nodes. We try to 
analyze what are these nodes in next Section for the ex- 
ample of Wikipedia where the meaning of a node is clearly 
defined by the title of the corresponding Wikipedia article. 



4 Communities of Wikipedia eigenstates 

To understand the meaning of other eigenstates in the 
core space we order selected eigenstates by their decreas- 
ing value \ipi \ and apply a frequency analysis on the first 
1000 articles with Ki < 1000. The mostly frequent word of 
a given eigenvector is used to label the eigenvector name. 
These labels with corresponding eigenvalues are shown in 
Fig.|5]in A— plane. We identify four main categories for the 
selected eigenvectors shown by different colors in Fig. |5J 
countries (red), biology and medicine (orange), mathe- 
matics (blue) and others (green). The category of oth- 
ers contains rather diverse articles about poetry, Bible, 




Fig. 6. Complex eigenvalue spectrum of the matrices S for 
Wikipedia. Highlighted eigenvalues represent different commu- 
nities of Wikipedia and are labeled by the most repeated and 
important words following word counting of first 1000 nodes. 
Color are used in the following way: red for countries, orange 
for biology, blue for mathematics and green for others. Top 
panel shows complex plane for positive imaginary part of eigen- 
values, while middle and bottom panels focus in the negative 
and positive real parts. Top 20 nodes with largest values of 
eigenstates \ipi\ and their eigenvalues Xi are given in Tables 
4, 5, 6, 7 (4 names marked by dotted boxes in figure panels). 



football, music, American TV series (e.g. Quantum Leap), 
small geographical places (e.g. Gaafru Alif Atoll). Clearly 
these eigenstates select certain specific communities which 
are relatively weakly coupled with the main bulk part of 
Wikipedia that generates relatively large modulus of |Aj|. 
The top 20 articles of eigenstate PageRank index Ki are 
listed in Tables 4,5,6,7. 

The eigenvector of Table 4 has a positive real A and 
is linked to the main article Gaafu Alif Atoll which in its 
turn is linked mainly to atolls in this region. Clearly this 
case represents well localized community of articles mainly 
linked between themselves that gives slow relaxation rate 
of this eigenmode with A = 0.9772 being rather close to 
unity. 

In Table 5 we have an eigenvector with real negative 
eigenvalue A = —0.8165 with the top node Photoactivat- 
able fluorescent protein. This node is linked to Kaede (pro- 
tein) and Eos (protein) with the later being isolated from 
coral. Its picture is listed in Portal: Berkshire/Selected pic- 
ture which has pictures of St Pauls Cathedral and Legoland 
Windsor that generates appearance of these, on a first 
glance unrelated articles, to be present in this eigenvec- 
tor. Thus, this eigenvector also highlights a specific com- 
munity which is somewhat stronger coupled to the global 
Wikipedia core, due to a link to selected pictures, with a 
smaller modulus of A compared to the case of Table 4. 

The eigenvector of Table 6 has a complex eigenvalue 
with |A| = 0.3733 and the top article PortaLBible. The 
top three articles of this eigenvector have very close val- 
ues of IV'iO)! that seems to be the reason why we have 
ip = arg(Aj) = 7r • 0.3496 being very close to ir/3. The 
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Bible is strongly linked to various aspects of human soci- 
ety that leads to a relatively small modulus value of this 
well defined community. 

In Table 7 we have an eigenvector which starts from 
the article Lower Austria with the eigenvalue modulus 
|A| = 0.3869. This article is linked to such articles as Aus- 
tria and Upper Austria with historical links to Styria. It 
also links to its city capital Krems an der Donau. The 
articles World War II and Jew appear due to a sentence 
"Before World War II, Lower Austria had the largest num- 
ber of Jews in Austria." Due to links with very popular 
nodes the eigenvector of this community has a relative 
small modulus of A. 

The above analysis shows that the eigenvectors of the 
Google matrix of Wikipedia clearly identify certain com- 
munities which are relatively weakly connected with the 
Wikipedia core when the modulus of corresponding eigen- 
value is close to unity. For moderate values of |A| we still 
have well defined communities which are however have 
stronger links with some popular articles (e.g. countries) 
that leads to a more rapid decay of such eigenmodes. 

The above results show that the analysis of eigenvec- 
tors highlights interesting features of communities and 
network structure. However, a priori it is not evident what 
is a correspondence between the numerically obtained eigen- 
vectors and the specific community features in which some- 
one has a specific interest. It is possible that for a well 
defined community it can be useful to construct a person- 
alized Google matrix (see e.g. [5]) and to perform analysis 
of its eigenstatcs. 



5 CheiRank versus PageRank plane 

As it is discussed in 18 ,15 ,16 ,[19 it is useful to look on 
the distribution of network nodes on PageRank-ChciRank 
plane (K, K*). For Wikipedia a large scale distribution is 
analyzed in [?, 19 and the networks of British Universities, 
Linux Kernel and Twitter are considered in [19] and |15j . 

In Fig. [7] we show for Wikipedia the distribution of 
nodes in (K, K*) plane for a relatively small range of top 
5000 values of K, K*. All directed links in this region are 
also shown. In fact the number of such links and num- 
ber of nodes in this region are relatively small. Indeed, a 
large scale density of nodes (see Fig. 3 in [IB]) shows that 
the density of nodes is not very high at the top corner of 
PageRank-CheiRank plane. This happens due to the fact 
that top nodes of PageRank, whose components are pro- 
portional to the number of ingoing links, are usually not 
those of CheiRank, whose components are proportional to 
the number to outgoing links. 

The correlation between PageRank and CheiRank vec- 
tors can be characterized by their correlator [T8lll9j : 

JV 

K = N*£P(K(j))P*(K*(i))-l . (3) 
i=l 




Fig. 7. Top 5000 values in PageRank-CheiRank plane (K, K*) 
of Wikipedia. All nodes and all links in this region are shown 
by black circles and red arrows respectively. 
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Fig. 8. Density of nodes W(K,K*) on PageRank-CheiRank 
plane (K, K*) for the networks of BBC (left panels) and Le 
Monde (right panels). Top panels show density in the range 
1 < K , K* < 10 4 with averaging over cells of size 100 x 100; 
bottom panels show density averaged over 100 x 100 logarith- 
mically equidistant grids for < In if, hi if* < lniV, the den- 
sity is averaged over all nodes inside each cell of the grid, the 
normalization condition is „„ W(K,K*) — 1. Color varies 
from blue at zero value to red at maximal density value. At 
each panel the a;-axis corresponds to if (or In if for the bot- 
tom panels) and the {/-axis to if* (or In if* for the bottom 
panels) . 



For our networks we find its values to be n — 4.08 
(Wikipedia), 41.5 (Cambridge 2011), 12.9 (Python), 140.2 
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(BBC), 0.85 (Le Monde). Except for the case of Le Monde, 
these values are relatively high showing that there is a 
significant correlation between PageRank and CheiRank 
probabilities on corresponding networks. We remind that 
for Linux Kernel networks the values of k are close to zero 
corresponding to absence of correlations there [T5l[T§] . 

The strong difference between k values for BBC and 
Le Monde shows that the structure of these two web sites 
is very different. To analyze this difference in a better way 
we show the density of nodes for these two networks on 
small and large scales in Fig. \E\ For small scale, shown by 
top panels, it is clear that the density of nodes is signif- 
icantly larger for BBC network. However, this difference 
becomes even more drastic on the large logarithmic scale 
of the whole network shown in bottom panels. Indeed, 
on a logarithmic scale we see that BBC network has a 
square like distribution region with a certain probability 
maximum around the diagonal K sa K* while Le Monde 
network has a triangular type distribution which is typical 
for networks without correlations between PageRank and 
CheiRank vectors, like it is the case for the Linux Kernel 
networks (see Fig. 4 in [IH])- Indeed, a random procedure 
of node generation on (K,K*) plane gives such a trian- 
gular distribution without correlations between PageRank 
and CheiRank nodes (see procedure description and right 
panel of Fig. 4 in [IS])- This analysis shows that BBC and 
Le Monde agencies handle information flows on their web 
sites in a drastically different manner. Thus for the BBC 
web site the most popular articles are at the same time 
also the most communicative ones while in contrast to 
that for the Le Monde web site the most popular and 
most communicative articles are very different. 



6 Links distribution over PageRank nodes 

To understand the properties of directional flow on a net- 
work it is also useful to analyze the distribution of links 
over PageRank nodes. We illustrate this approach for the 
Wikipedia network. Suppose that all nodes are ordered 
in a decreasing order of modulus of a given eigenvector. 
For the PageRank vector all nodes are numbered by the 
PageRank index K, while for a given eigenstate ipi(j) an 
nodes are numbered by a local corresponding index Ki. We 
now divide all nodes on two parts A and B with 1, Ki 
nodes for A and Ki + 1, N nodes for B. Then we de- 
termine the number of links Naa starting and ending in 
part A, the number of links Nab pointing from part A to 
part B and the number of links Nba pointing from part 
B to part A. The number of links inside part B is then 
Nb b = N e — N A a — Nab - N B a ■ For the PageRank vector 
the dependence of Naa on K was analyzed for different 
networks in [TS]. Here we generalize this concept to con- 
sider links between two parts A, B for various eigenvectors 
of the Google matrix. 

According to the data of Fig. [S] we find that for all 
eigenvectors Naa fx Ki grows approximately in an al- 
gebraic way with the exponent being close to 1.5 being 
similar to the PageRank case considered in [TS]. However, 




Fig. 9. Number of links between or inside sets A and B de- 
fined by the index Ki ordered by decreasing absolute value 
of Wikipedia eigenstates. The number of links starting and 
pointing to nodes inside the set A (Naa) is shown in top 
panel as a function of Ki. The cases of links from set A to 
set B (Nab) and from B to A (Nba) are shown in middle 
and bottom panel respectively. Note that the total number 
of links is conserved and the quantity Nbb can be obtained 
as N B b = N e - Naa - Nab - Nba- The case of PageR- 
ank vector with damping parameter a — 0.85 is shown by 
a black curve versus K index. The color curves show the 
cases of four core space eigenvectors \ipi\ of S versus their 
ranking indexes Ki. Red and green curves are the eigenvec- 
tors corresponding to the two largest core space eigenvalues 
(in modulus) being Ai = 0.99998702 and A 2 = 0.97723699 
respectively; blue and pink curves are the eigenvectors cor- 
responding to two complex eigenvalues with large imaginary 
part being A 52 = -0.350 33 1 6 + iO. 77373677 and A 86 4 = 
-0.34293502 + iO.43144930 respectively. 



the dependence of Nab and Nba on Ki is rather differ- 
ent for different eigenstates. For the PageRank and the Ai 
eigenvector we find practically the same behavior linked 
to the fact that at a = 0.85 the PageRank vector is rather 
close to the first core space eigenvector (see discussion at 
[T5]). Here the interesting point is that at small values 
of Ki we have Nba being larger than Nab almost by a 
factor 100. This is due to the fact that low rank nodes 
at large Ki point preferentially to high rank nodes at low 
Ki. For other three eigenvectors with A2, A52, As64 we 
find well pronounced step-like behavior of Nab , Nba on 
K i . We argue that the step size in Ki is given by the size 
of a community which has preferential links mainly inside 
the community. Indeed, for the eigenvector of A2 (see Ta- 
ble 3) we see that the community size is approximately 
N cs R3 l/|V'i| ~ 100 that corresponds to the step size in 
Ki « 70 for this case. 
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These results show that the analysis of the link distri- 
bution over the PageRank index provides interesting and 
useful information about characteristics and properties of 
directed networks. 



7 Discussion 

In this work we performed a spectral analysis of eigen- 
values and eigenstates of the Google matrix of Wikipedia 
and other networks. Our study shows that the spectrum of 
the core space component has eigenvalues in a close vicin- 
ity of A = 1 and that there are isolated subspaces which 
give a degeneracy of the eigenvalue A = 1. The eigenval- 
ues and eigenstates with relatively large values of |A can 
be efficiently determined by the powerful Arnoldi method. 
These eigenstates are mainly located on well defined net- 
work communities. We also find that the spectrum changes 
drastically from one network to another even if the distri- 
bution of links and decay of PageRank is rather similar 
for the networks considered. This means that the proper- 
ties of directed networks strongly depend on the internal 
network structure. We show that the correlation between 
PageRank and CheiRank vectors highlights specific prop- 
erties of information flow on directed network. For exam- 
ple, this correlation demonstrates a drastic difference be- 
tween web sites of BBC and Le Monde. The distribution 
of links between PageRank nodes also provides an inter- 
esting information about the network structure. On the 
basis of our studies we argue that the developed spectral 
analysis of Google matrix brings a deeper understanding 
of information flow on real directed networks. 





N 


N t 




Wikipedia 


3282257 


71012307 


3000 


Cam. 2011 


893176 


15106706 


4000 


Python 


541545 


9031262 


5000 


BBC 


319637 


7278258 


4000 


Le Monde 


134196 


10621445 


5000 



Table 1. Parameters of all networks considered in the paper. 





N s 


N d 


^max 


N ciTC . 


Ni 


Wikipedia 


515 


255 


ll 


381 


255 


Wikipedia* 


21198 


5355 


717 


8968 


5365 


Cam. 2011 


808 


329 


74 


343 


332 


Cam. 2011* 


186062 


2039 


5144 


2044 


2041 


Python 


198 


23 


72 


26 


23 


Python* 


1589 


25 


951 


35 


31 


BBC 


50 


19 


28 


19 


19 


BBC* 


39 


28 


6 


28 


28 


Le Monde 


83 


64 


18 


64 


64 


Le Monde* 


789 


354 


15 


373 


361 



Table 2. G and G* eigespectrum parameters for all networks. 



2011 are slightly different from those given in [TH] due to a 
slightly different procedure of cleaning of row data collec- 
tion (e.g. count of pdf and other type nodes). Eigenvalues 
for eigenvectors are shown in Fig. 1 with the colors red, 
green, blue or pink corresponding to colors of Table 3. The 
index m of A m in Tables 3,4,5,6,7 counts the order number 
of core eigenvalues in a decreasing order of |A m |. 
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9 Appendix 

We give here tables with all network parameters used in 
the paper. The notations used in the tables are: N is net- 
work size, Ng is the number of links, ha is the Arnoldi 
dimension used for the Arnoldi method for the core space 
eigenvalues, Nd is the number of invariant subspaces, (i max 
gives a maximal subspace dimension, N c i IC , notes number 
of eigenvalues on the unit circle with |Aj| = 1, Ni notes 
number of unit eigenvalues with A; = 1. We remark that 
N s > iV C i rc . > iVi > Nd and N s > c? max and the av- 
erage subspace dimension is given by: (d) = N s /Nd. We 
note that the values of N, Ng for network of Cambridge 
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Gaatu Alii Atoll 
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2 
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0.00812 


6 


ilitnaaanoo (^Oaalu Alii AtollJ 


U.UU0U8 


4 


Uniguran (Oaalu Alii AtollJ 


U.UUoUD 
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Maarandhoo (Gaaru Alif Atoll) 


0.00806 


a 



Hulhimendhoo (Gaafu Alif Atoll) 


O.OO0O0 


7 


Araigaiththaa 


0.00798 
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Baavandhoo 


0.00798 
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Baberaahuttaa 
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10 


Bakeiththaa 


0.00798 


11 
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Table 4. Node rank for decreasing modulus of eigenstate 
corresponding to the eigenvalue A 2 — 0.97724 (see Fig[6]). 





Ago = -0.8165 ("protein") 


\H 


1 


Photoactivatable fluorescent protein 


0.22767 


2 


Kaede (protein) 


0.13942 


3 


Eos (protein) 


0.13942 


4 


Fusion protein 
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17 
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18 
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0.00177 



Table 5. Node rank for decreasing modulus of eigenstate \tpi\ 
corresponding to the eigenvalue Aso = —0.8165 (see Fig[S]). 
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Table 6. Node rank for decreasing modulus of eigenstate \ipi\ 
corresponding to the eigenvalue Ai48i = 0.1699 + i0.3325 (see 
Fig©. 
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